Wildfires in the US

Statistical computation and visualization (MATH-517)

Zineb AGNAOU https://github.com/ZinebAg , Fahim BECK https://github.com/FahimBeck , Salima JAOUA https://github.com/salimajaoua , Matias JANVIN https://github.com/matiasjanvin , Seorim PARK https://github.com/seorimpark
November 19, 2021

Introduction

Wildfires are uncontrolled fires that burn in the wildland vegetation, often in rural areas. They are not limited to a particular continent or environment, and burned different kinds of ecosystems for hundreds of millions of years on Earth (André Gabrielli (2019)). The problem of wildfires is at the stake all over the world, along with the topic of climate change and preservation of nature and ecosystems. There has been several major cases of wildfires recently, growing in number and severity: for instance, the California wildfires in 2020 became one of the largest wildfire season in the California history (Holly Yan, Cheri Mossburg, Artemis Moshtaghian and Paul Vercammen (2020)) with several millions of acres burnt (Topher Gauk-Roger, Stella Chan, Jason Hanna and Steve Almasy (2020)). Also, Turkey went through the worst wildfire season of the country in July and August 2021 (Mert Ozkan and Ezgi Erkoyun (2021)), and the 2019-2020 bushfires in Australia (also known as the Black Summer) killed several billions of animals. A lot of them were endangered species, which some were believed to be driven to extinction from this incidence (Michael Slezak (2020)). Therefore, a lot of countries aim at minimizing the size and the number of occurrences of wildfires, since they can be the cause of many direct and indirect fatalities in humans (Steven Reinberg (2021)), as well as air pollution (Sarah Gibbens (2021)) and the loss of ecosystems and biodiversity (the case of Black Summer).

In order to reduce the number and the severity of wildfires, understanding the main factors at the origin of the catastrophy is necessary. The incidences are very often caused accidentally (burning debris, agricultural activities, campfires, smoking), or intentionally (arson, children). Although the latter case can be prevented, human or non-human accidents can always happen and they are hard to predict ((Wildfire Causes, n.d.)). However, we can suspect that there are certain natural conditions that make those accidents easier to happen and to grow them bigger in size. By identifying them, the states can build a strategy to efficiently suppress wildfire once it happens, and get prepared to fight against for locations that are highly possible to catch fire at a certain period of the year.

In addition, those factors change as the time goes by, and they generate a better or worse conditions for wildfires to happen. For instance, global warming is highly suspected to be one of the main reasons why the wildfires were more recurrent in the recent days (Alejandra Borunda (2021)). Countries have been undergoing climate changes, and those unexpected events can be the seed of the recent disasters.

Research questions

The purpose of this investigation is to give an answer to the following question :

What are the main factors that affect the propagation of wildfires within the United States ?

An investigation will be conducted and consists in answering these subquestions :

  1. How have the number of fires in the United States of America evolved with time from 1993 to 2015?

  2. How do the land covers vary over time?

  3. How are fires distributed across the land covers and meteorological factors?

Approaches

To answer the above questions we will proceed as follows. A descriptive and visual approach where interactive plots will be produced with information on different dimensions: geographically, temporally and by different factors. Afterwards, the distribution of land covers will be studied. Also, the analysis of the variation of land covers in the same location will be carried because of the changes that have been noticed in some areas. We will then proceed with the analysis of the number of fires. To do so, the analysis of the land covers and meteorological parameters will be conducted. After noticing that the correlation between parameters was low, we used subsets and performed a quantile regression.

Sources of information / datasets

To perform this analysis, a dataset for the United States from 1993 to 2015 will be used. It contains 563,983 rows with 37 columns. The columns are the following:

Please note that the area proportions \(lc1\) to \(lc18\) do not always sum to exactly 1 for each pixel and month since a few classes with quasi-0 proportion have been removed.

Since the original data was given under the context of a prediction competition with the University of Edimburgh, there is a 8,000 of missing values in each of the \(CNT\) and \(BA\) columns. The missing values are not located necessarily in the same lines for the two features.

Exploratory Data Analysis

When considering only rows without missing values, 452’930 rows remain.

Table (1): Summary of features CNT, BA and the sum of lcs by row
Statistic Min Pctl(25) Median Pctl(75) Mean Max
CNT 0 0 0 2 2.280 359
BA 0 0 0 1.6 158.898 538,054
sum_of_lcs 0.822 0.997 0.999 1.000 0.997 1.000

As shown in the Table (1), wildfires remain relatively rare events. More than 75% of the locations considered have less than two fires per month when looking at the feature \(CNT\). Same applies for the feature \(BA\) representing aggregated burnt area, where the distribution is strongly positively skewed.

As stated in the data description, the proportions of the 18 land covers don’t always add up to one. Looking at Figure 1 and in Table (1), we can see that the minimum value for the sum is 0.82. It is also seen from the 1st Quantile value that only 25% of the data has a sum below approx 0.99. We therefore continue with the data considering it is close enough.

Histogram representing the sum of land covers by row from 1993 to 2015. The histogram is negatively skewed

Figure 1: Histogram representing the sum of land covers by row from 1993 to 2015. The histogram is negatively skewed

Distribution of land covers from 1993 to 2015

Figure 2: Distribution of land covers from 1993 to 2015

In Figure 2 is displayed the distribution of land covers using Boxplot. We can see that most of the land cover represent less than 10% of the the location considered, this shows that the area considered are diverse.

Some transformations on the features were made: first the temperature was converted from Kelvin to Celsius. Next, the U-component of wind (the wind speed in Eastern direction) and V-component of wind (the wind speed in Northern direction) were aggregated using the euclidian norm of the vector: \[W\hspace{-2pt}speed=\sqrt{{W\hspace{-2pt}speed_{East}}^2 + {W\hspace{-2pt}speed_{North}}^2 }\] with Wspeed the wind speed.

Visualisation and wildfires over time

In order to do a descriptive analysis of the data and before exploring the different factors, we first plot on the map the number of cases of wildfires (denoted as \(CNT\) ) as well as the burnt area (denoted as \(BA\) ) from it with respect to time. This was made with the objective of determining which states are the most affected by wildfires and identifying the time when the fires happen the most.

The given dataset had a list of different coordinates in the United States. To determine which coordinates belong to which state, the python library \(\it{reverse\_geocoder}\) (link) was used. This library gives the closest address given the coordinates. With this, we proceed by extracting the name of the state, added up the numbers and stored in a dictionary for each state. \(CNT_i\) or \(BA_i\) stand for the values of the dictionary for i a state in the U.S.A . Let us also denote \(CNT_k\) or \(BA_k\) the value for k a given coordinate.

To better visualize the data, several adjustments have been made. First, the number of incidences and the burnt area were divided by the total area of the states to make a comparison. Then this value was multiplied by \(10^5\) (for \(CNT\)) or \(10^4\) (for \(BA\)) in order to get the number of wildfires/burnt area of the state per \(10^4km^2\) or \(10^5km^2\) respectively. Also, we realized that the obtained numbers could go from the order of \(10^{-2}\) to \(10^{3}\). In order to have a reasonable color scale for each state, the log scale was applied. The final numbers for the plots are calculated as follows:

\[Final\_CNT_i=log_2\left(\frac{10^5\left(\sum\limits_{k=coordinate\_0}^{total\_number\_of\_coordinates\_in\_i}CNT_k\right)}{total\_area\_of\_i}+1\right)\] \[Final\_BA_i=log_2\left(\frac{10^4\left(\sum\limits_{k=coordinate\_0}^{total\_number\_of\_coordinates\_in\_i}BA_k\right)}{total\_area\_of\_i}+1\right)\] for i a state in the U.S.A .

The numbers close to 0 are trivial, hence 1 is added before taking the log to avoid having meaningless outliers with the scale starting with big negative numbers.

In addition, we plotted the number of incidences/burnt area for each location in red scatter points with a size scale. To adjust the numbers, the log scale was again applied. The numbers were obtained as follows:

\[Local\_CNT_k=4log_2\left(CNT_k+1\right)\]

\[Local\_BA_k=2log_2\left(BA_k+1\right)\]

for k a given coordinate of the dataset.

Several interactive maps in python with the chosen scaling methods was made, but due to technical issues (ref (1AD)), deploying the maps with an external link was not possible. However, running the interactive maps on the file \(\it Visualisation\_general.ipynb\) and \(\it Visualisation\_specific.ipynb\) on local servers is still possible, provided that the needed libraries are installed. Figures 3, 5, 6, 7, 8 are animated plots of the interactive map with respect to different time frame, and the mode (\(CNT\) or \(BA\) )

Figure 3 is an overview of one option that can be chosen for the plot. The color scale shows the burnt area for each states, and the circle scatter plot shows the number of incidences for each coordinates.

Overview of the interactive map, BA in colors and CNT in scatter circles

Figure 3: Overview of the interactive map, BA in colors and CNT in scatter circles

One observable aspect from these plots is that the number of incidences does not always match with the burnt area. For instance, on June 1993, Texas (The state at the very bottom in the middle on the map) had about \(10^3\) acres burnt per \(10^4km^2\), but as shown on the Figure 4, the number of incidences was not as huge compared to the burnt area seeing the number and the size of red dots in Texas.

Plot in 06/1993, BA in colors and CNT in scatter circles

Figure 4: Plot in 06/1993, BA in colors and CNT in scatter circles

Figure 5 displays the mean out of number of cases of all the years at a specific month. One noticeable aspect is that the states at the edge of the country are most likely to have a high number of incidences. We have a phenomenon similar to the “eye of the storm” at the center, and moving around. At the beginning of the spring season, states at the east side are affected the most : but as the time goes by, the “eye of the storm” moves to the east, and finishes by having states at the west being the most affected.

Animated map of mean of number of wildfires from 1993 to 2015 at each month

Figure 5: Animated map of mean of number of wildfires from 1993 to 2015 at each month

Figure 6 is the mean out of burnt area of all the years at a specific month. Compared to Figure 5, we can recover more or less the same phenomenon with the change being more dramatic. For example, the numbers at Nevada grows from about \(6.88\) acres per \(10^4km^2\) in March to \(5.80.10^3\) acres per \(10^4km^2\) in August, compared to the number of cases that grows from \(3.02\) to \(67.62\) at the same period. By the end of summer, the impact of wildfires are huge at the west side of the country. In August, almost half of the land has more than \(2^8\) acres burnt per \(10^4km^2\).

Animated map of mean of burnt area from 1993 to 2015 at each month

Figure 6: Animated map of mean of burnt area from 1993 to 2015 at each month

Figure 7 is the mean out of number of cases from March to September at a specific year. One can deduce that the evolution is heterogeneous throughout the years, but excluding year 2015, the “eye of storm” phenomenon is apparent, with Kansas and the states around it being almost untouched, and the states at the edge being the most affected.

Animated map of mean of number of wildfires at each year

Figure 7: Animated map of mean of number of wildfires at each year

Figure 8 is the mean out of burnt area from March to September at a specific year. First we can recover more or less the same result as the one from Figure 7, except the fact that the eye moved from Kansas to Illinois. It seems like the severity of wildfires is almost periodic : at year 1997, 2004, 2010, the cases are less severe, but in 2000, 2007, 2012, the cases are way more severe.

Animated map of mean of burnt area at each year

Figure 8: Animated map of mean of burnt area at each year

We can recover the observation of Figure 8, by plotting the total burnt area (\(\sum\limits_{i=state\_0}^{all\;the\;states}BA_i\)) with respect to the year. Figure 9 shows the evolution of the total burnt area in the U.S. from 1993 to 2015. We can observe that the numbers are indeed periodic. In addition, the graph is slightly increasing considering the plot of the rolling mean of window 4, which implies that the incidences are becoming more and more severe by average. Figure 10 shows the evolution of total number of wildfires (\(\sum\limits_{i=state\_0}^{all\;the\;states}CNT_i\)). Observing the rolling mean of the same condition as the total burnt area, the numbers seem to oscillate but do not have a particular increasing or decreasing trend over time.

Total burnt area with respect to time

Figure 9: Total burnt area with respect to time

Total number of wildfires with respect to time

Figure 10: Total number of wildfires with respect to time

Wildfires and the land cover:

Correlation between Land covers and Wildfires:

Correlation plot for the fires occurences and the different land covers

Figure 11: Correlation plot for the fires occurences and the different land covers

The next step is to determine if there is a significant correlation between the different land covers and the appearance of fires. In order to do so, we use a database filtered of missing data. As can be seen in Figure 11, the correlation between the fire and the land cover remains very low. Considering that a correlation is strong when its absolute value exceeds 0.8, there is no strong correlation between the features nor between the features and the apparition of the fires.

We can nonetheless note that, besides being small, the correlation between the apparation of fires and cropland rainfed herbaceous cover, mosaic cropland, shrubland, grassland, bare areas and water is negative, while the correlation with mosaic natural vegetation, tree broadleaved evergreen closed to open, tree broadleaved deciduous closed to open, tree needleleave evergreen closed to open, tree needleleaved deciduous closed to open, tree mixed, mosaic tree and shrub, sparse vegetation, tree cover flooded fresh or brakish water, shrub or herbaceous cover flooded and urban areas is positive.

Land cover over time and location

This section will address the distribution of land covers over time. To do so, the geographical coordinates of Malibu, California has been selected. This area has been heavily impacted by wildfires in recent years. ‘In November 2018, the wealthy coastal enclave of Malibu was engulfed by the Woolsey Fire, which spread to over 96,000 acres of land outside of LA and is now 35% contained. At least two people were pronounced dead in Malibu on Friday’ (Aria Bendix (2018)). In Figure 12, we can see the exact location of the studied point on the map of the United States.

Location considered in the analysis here is Malibu, CA

Figure 12: Location considered in the analysis here is Malibu, CA

In Figure 13, we can see the distribution of the land covers with time. We can see that the proportions are relatively stable and consistent with time. The predominant cover is the \(urban area\) with 31% of the surface in 1993 this proportion only increases with time.

Distribution of land covers across time from 1993 to 2015 in Malibu, CA

Figure 13: Distribution of land covers across time from 1993 to 2015 in Malibu, CA

Although in this particular example of Malibu, the proportion of different land covers seems consistent, this is not always the case. Taking the coordinates in Figure 14 in Wisconsin, it can seen that the proportions change drastically across the years. The percentage of land representing \(tree broadleaved evergreen closed to open\) increases with time until it becomes the predominant one, while \(shrub or herbaceous cover flooded\) decreases drastically with time as seen in Figure 15. We will not analyze the reasons and parameters that may have motivated, but will try to quantify it in the next section.

Location considered in the analysis here Wisconsin

Figure 14: Location considered in the analysis here Wisconsin

Distribution of land cover across time in Wisconsin

Figure 15: Distribution of land cover across time in Wisconsin

To access the interactive part, click this link. In this app you will able to select the longitude and lattitude and see the distribution of the land covers.

Predominant Land covers Analysis and Shifts

Here will be represented the territory according to the predominant type of coverage. Figure (16) shows this arrangement in 2000; \(water\) dominates the maritime borders, \(tree broadleaved\) are predominant in the east of the country while the west is dominated by tree needleleave. The big cities of New York, Los angeles and other large cities are easily identifiable by their predominance of urban space.

Predominant land covers in 2000 by location

Figure 16: Predominant land covers in 2000 by location

To access the intarctive part, click this link. In this app you will able to select the year and the predominant land cover will display directly on the us map.

The occurence of fires in a location with its predominant land cover can also be an interesting analysis.

Mean number of fires in the United States of America in 2000 by location

Figure 17: Mean number of fires in the United States of America in 2000 by location

In Figure (17), the highest occurency for fires happen in Florida and Georgia in the South East, in New Jersey in the East coast and all along the West coast. These areas are characterised by the predominance of \(bradleaved evergreen closed to open\) trees. This observation can be explained by the effect of climbing fire – the easy inflammable land covers e.g. \(grassland and shrubs\) act as a ladder to higher land cover with higher fuel capacity as trees. Usually, ‘the forests are more prone to the fire only when there is a particularly low near-surface SM (soil moisture), most likely from moderate to extensive drought’ (Schaefer, Alexander J. and Magi, Brian I. (2019)).

Also, it can be observed that the closeness of urban land cover is correlated with occurrence of wildfires (For instance near New York city and Los Angeles), meaning that the human activity can be the trigger of wildfires. Other commonly known trigger of wildfires is the presence of dry thunderstorms with high activity of cloud-to-ground flashes. This mentioned trigger is usually the cause of the higher frequency of big wildfires in the Pacific Coast.

The combination of \(grassland\) with \(needle-leaved trees\) might be dangerous as can be visible on the Figure (16) and Figure (17).

The presence of \(needle-leaved trees\) for wildfires is crucial, since the trees are closer to each-other and the dead branches on the floor with sap provide enough fuel to higher spread of wildfire. Moreover, the land cover of \(deciduous trees\) can be especially dangerous in the early spring. The article by Barros et al. (Barros, Ana M. G. and Pereira, José M. C. (2014)) studying the fire selectivity in Portugal suggests that the selectivity for wildfire counts is higher for \(shrublands, grasslands and conifers\), but lower for agricultural areas such as \(cropland\). This suggestion seems to be confirmed in the US by the early observations of the Figures (16) and (17).

Number of shifts of predominant land cover by location

Figure 18: Number of shifts of predominant land cover by location

To analyze the variation, we create a vector whose values start at zero for any given location and increases by one as soon as the predominant coverage area changes from one month to the next. What is first interesting to note is that all locations change their predominant land cover at least once. Indeed in Figure (18), the shift values start at 1 and they also do not exceed 5. Considering that this measurement is taken over 22 years, this is still relatively low. Note that we will not further research on the reason behind these changes during this project.

Land covers and meteorogical correlation

We want to understand the correlation between land covers and meteorological variables. To do so, we are going to clean our data first, rename the variables by their description so that we can easily analyze our results and then compute the correlation between the variables. The problem is that when trying to output the matrix, we get a warning saying that the matrix is too big. To get an glimpse of what variable can be highly related, we are going to output the correlation between a pair of land cover and meteorological variable when it is bigger than a threshold manually set.

[1] "cor( 11 , 7 )= -0.722896324256492"

Here, we choose a threshold of 0.6 and none of the correlations has been displayed except one. This means that those who haven’t been printed are less that 0.6. We get that the correlation between the shrublands and the surface net thermal radiation is of -0.72. This seems normal: the more radiation is issued by the surface, the less shrub we get. But the results are not as good as expected. In fact, when printing all the correlations (threshold = 0) we remark that the values are very small. This can come from the fact that there are no correlations between the variables or that the dependence is not linear here. The first option seems less likely. So we are going to compute the correlation with spearman method to see if the relation between the variables is monotonic.

[1] "cor( 11 , 7 )= -0.626147167673006"
[1] "cor( 16 , 8 )= 0.691392764491666"
[1] "cor( 18 , 8 )= 0.6009781819297"

Now by changing the method, we observe that we get that the land cover variable 11 (shrubland) and the surface net thermal radiation are correlated, same result as earlier. And, we also get that the water, urban and surface pressure are correlated. The pressure in the water is higher that the atmospheric pressure and increases with the depth. The pollution of the air explains the positive correlation between urban lands and surface pressure. To visualize our results and try to show more, we are going to regroup our data so that we can compute a correlation matrix with again spearman method. This part consists of taking variables of land covers that are similar with common characteristics ans sum them together.

Correlation heatmap after regrouping the land covers with meteorological variables.

Figure 19: Correlation heatmap after regrouping the land covers with meteorological variables.

This technique has been useful to represent our previous result. Also in the figure, Evaporation is negatively correlated with the trees, meaning the more trees there are, less evaporation there is. For the next project, one can think of trying to find correlation between other land cover and meteorological data by plotting the scatter plot to find a curve that approximate the data to look for non monotonic or non linear correlation.

Spreading fire based on the type of land cover

First, to analyze the spread of a fire, let’s look for the distribution of the number of fires in a grid during a month.

We remark that most of the values are close to 0 and because of the outliers we can’t see the distribution very clearly. We thus filter the outliers to have a better view of the distrubution.

Now, since the aim of this part is to predict the spread of a fire given the land covers data. Let’s plot the distribution of the number of fire according to each land cover variable.

Most of the plots have a distribution that can make this of Poisson distribution, or power laws in certain case.

In this section, we try to model the fact that there was a fire in a grid for a given month given the land cover specification of the grid. The problem is that we have the number of fire in the grid. One can think of using the zero inflated Poisson regression. In fact, the zero inflated Poisson is used to count data with excess zeros and overdispersion, which describe well our data. It combines the Poisson distribution and the logit distribution.


Call:
zeroinfl(formula = CNT ~ lc1 + lc2 + lc3 + lc4 + lc5 + lc6 + 
    lc7 + lc8 + lc9 + lc10 + lc11 + lc12 + lc13 + lc14 + lc15 + 
    lc16 + lc17 + lc18, data = df)

Pearson residuals:
     Min       1Q   Median       3Q      Max 
 -2.6002  -0.6556  -0.4760  -0.1269 164.8916 

Count model coefficients (poisson with log link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.4275     0.1150  -3.718 0.000201 ***
lc1           3.8936     0.1317  29.561  < 2e-16 ***
lc2           1.7546     0.1152  15.234  < 2e-16 ***
lc3           0.1621     0.1427   1.136 0.255880    
lc4           2.5901     0.1170  22.144  < 2e-16 ***
lc5           2.3983     0.1148  20.897  < 2e-16 ***
lc6          -7.7769     0.4706 -16.526  < 2e-16 ***
lc7           2.6995     0.1149  23.490  < 2e-16 ***
lc8           2.3755     0.1158  20.514  < 2e-16 ***
lc9           1.0370     0.1172   8.845  < 2e-16 ***
lc10          4.5284     0.1232  36.762  < 2e-16 ***
lc11          1.5169     0.1153  13.158  < 2e-16 ***
lc12          1.9933     0.1155  17.256  < 2e-16 ***
lc13         -1.5373     0.1658  -9.271  < 2e-16 ***
lc14          2.7590     0.1212  22.764  < 2e-16 ***
lc15          2.5966     0.1282  20.260  < 2e-16 ***
lc16          5.8753     0.1176  49.973  < 2e-16 ***
lc17         -0.3846     0.1310  -2.937 0.003314 ** 
lc18          1.4079     0.1162  12.121  < 2e-16 ***

Zero-inflation model coefficients (binomial with logit link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -9.8587     0.4163  -23.68  < 2e-16 ***
lc1           9.8726     0.4567   21.62  < 2e-16 ***
lc2          11.9956     0.4168   28.78  < 2e-16 ***
lc3          12.7570     0.4947   25.79  < 2e-16 ***
lc4           9.9462     0.4243   23.44  < 2e-16 ***
lc5          10.0305     0.4162   24.10  < 2e-16 ***
lc6          18.0285     1.0028   17.98  < 2e-16 ***
lc7           8.0604     0.4164   19.36  < 2e-16 ***
lc8           9.4954     0.4201   22.60  < 2e-16 ***
lc9          11.0770     0.4201   26.37  < 2e-16 ***
lc10          1.2508     0.4685    2.67  0.00759 ** 
lc11         10.7596     0.4172   25.79  < 2e-16 ***
lc12         10.8828     0.4172   26.08  < 2e-16 ***
lc13         16.6698     0.5593   29.81  < 2e-16 ***
lc14          8.6888     0.4388   19.80  < 2e-16 ***
lc15          9.2563     0.4608   20.09  < 2e-16 ***
lc16          7.3972     0.4386   16.87  < 2e-16 ***
lc17         12.1465     0.4312   28.17  < 2e-16 ***
lc18         11.4647     0.4180   27.43  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 43 
Log-likelihood: -1.138e+06 on 38 Df

Below, you can find a block of output containing Poisson regression coefficients for each of the variables along with the standard errors, z-score and p-values for the coefficients. A second block follows with the inflation model which includes logit coefficients for predicting excess zeros. All of the predictions in both the count and inflation portion are statistically significant ( all p-values are very small) except for the land cover 3 in the count model. But otherwise this means that the null hypothesis that the coefficient is equal to 0 is rejected for all the coefficients. Hence this model fits the data significantly better than the null model.

Wildfires and meteorogical factors

In this section, we try to find some meteorological factors that could potentially trigger a wildfire. To do so, we will plot a Pearson correlation heatmap. The first step is to get a subset from our initial dataset that is more representative of wildfire conditions. As a fire is quite a rare event, we will not be able to find any links if we were considering the entire population (the correlations would be around 0).

To get our subset, we will only consider areas per month which have at least a certain number of wildfires and which have been burnt above a certain threshold. For our example, we take rows with more than 60 wildfires (\(CNT \geq 60\)) and 5000 acres of aggregated burnt area (\(BA \geq 5000\), approximately 20km2) in the current month.

We check whether the period in years of the subset matches that of the initial dataset, i.e. 1993 to 2015. We see that it does.

range(selection$year)
[1] 1993 2015

As a result, a correlation heatmap will allow us to distinguish certain factors that could favor a wildfire (Figure 20).

Correlation heatmap after subsetting (explained above) with meteorological variables.

Figure 20: Correlation heatmap after subsetting (explained above) with meteorological variables.

Since we are looking for meteorological risk factors, we may only consider the first two rows or columns of the heatmap.

First, we see that there are no pairs of fully correlated variables. Indeed, the highest value (absolute) is 0.43. The number of wildfires and months are negatively correlated. This can be explained by the fact that months range from March to September and that there are more occurrences of fire in March and April than in August and September. The fact that the numbers are small shows that the causes of a wildfire are meteorologically multifactorial.

An interesting observation is that \(CNT\) and \(BA\) are negatively correlated, meaning that the more area burnt, the fewer wildfires there are. Most of the time, the signs of correlations in the first two rows are not the same for these two variables. For example, by looking at temperatures and solar radiation, the higher they are, the more area burnt, but the less wildfires. This can be explained by the fact that our subset contains substantial wildfires, so there aren’t many, but they are destructive.

Another variable that we can comment on is precipitation. For an area to be burnt, there must be no rain or high humidity conditions, hence the negative correlations between \(BA\) and precipitation. The positive correlation with \(CNT\) could be explained by unstable weather conditions. As with wind speed, the higher it is, the more unstable the air masses and the greater the risk of a natural disaster.

But we should be careful with these correlations as we used a subset and as their values are not that high in absolute value.

Next, a map of the US represents the areas considered in the subset, regardless of months and years (Figure 21).

US map representing the areas considered in the subset (explained above), regardless of months and years. This means that if a particular area appears at least at one time in the subset, it will be shown by a red square on the map.

Figure 21: US map representing the areas considered in the subset (explained above), regardless of months and years. This means that if a particular area appears at least at one time in the subset, it will be shown by a red square on the map.

It can be seen that the areas represented are not that many in number and are located where the forest cover is the most important. For example, there are almost no red squares in the central region of the United States, as this area is mostly non-forest land.

For other thresholds, an interactive app is available here. Instructions and explanations are displayed on the app. Basically, you have the same results as above, but thresholds for \(CNT\) and \(BA\) are entered by the user.

Subgroup analysis

In the preceding analysis, we have seen that many of the fires arise in a small subset of geographical gridpoints. In other words, there is considerable heterogeneity between the geographical gridpoints. Furthermore, in the interactive app which explored meteorological correlations, it was found that meteorological variables were more strongly correlated with the number of wildfires in subgroups of geographical gridpoints with a large number of fires. This motivated us to conduct an explicit subgroup analysis by employing quantile regression. We employed the linear quantile regression model \[ Q_Y(\tau\mid X) = a_0(\tau) + b_0(\tau)X \] where \(Y\) is the outcome (number of wildfires (\(N\)) or the burnt area (\(A\))), \(X\) is the exposure covariate (risk factor) and \(Q_\cdot(\tau\mid X)\) is the conditional quantile corresponding to percentile \(\tau\) for exposure level \(X\). This model allows us to explore dynamic covariate effects of \(X\) across strata of \(\tau\). In Figure 22 and Figure 23, we have fit the quantile regression model with \(X=\) temperature for a random sample of 10 000 geographical gridpoints from the year 2015.

Quantile regression for the effect on temperature on number of wildfires

Figure 22: Quantile regression for the effect on temperature on number of wildfires

Quantile regression for the effect on temperature on burnt area

Figure 23: Quantile regression for the effect on temperature on burnt area

Whereas the association between temperature and wildfire outcomes (i.e. number of wildfires and burnt area) is not very strong in the marginal analyses conducted earlier (unconditional on \(\tau\)), we observe a very clear association within strata given by percentiles of number of wildfires or burnt area. This indicates that temperature has a heterogeneous effect on wildfires: it is evident that temperature exacerbates the probability of wildfires in areas that are prone to these. As a control, we have repeated the analysis with the exposure \(X\) being the proportion \(W\) of land cover constituted by water in Figure (24) and Figure (25).

Quantile regression for the effect on water landcover on number of wildfires

Figure 24: Quantile regression for the effect on water landcover on number of wildfires

Quantile regression for the effect on water landcover on burnt area

Figure 25: Quantile regression for the effect on water landcover on burnt area

The negative control shows what we expect: areas which are susceptible to wildfires experience substantially less of these the larger the proportion of land cover that is constituted by water. These illustrations highlight the power of quantile regression in elucidating heterogenous covariate effects across units.

The strong association between temperature and wildfires for large quantiles for wildfire outcomes prompots us to consider how the yearly mean temperature has evolved with time, as this may tell us about the risk of wildfires in the future if temperatures continue to rise. In Figure (??), we observe a weakly rising trend (we have chosen vertical axis limits to start above zero to highlight variations in temperature). We therefore remark that we may experience an exacerbation of wildfires in the future, if temperatures continue to rise.

Conclusion

In this investigation, we have studied three sets of risk factors for wildfires: (1) the time period (2) the type of land cover and (3) local meteorological conditions. The interactive map as a function of time let us analyse the influence of the time period within a year and throughout the years. Although the map looked very heterogeneous, some particular trends have been observed for a specific time period in a year. Also, the global trend for burnt area was periodic, and slightly increasing throughout the years whereas the number of wildfires did not. In our exploration of land cover, we found that the proportion of urban areas has been increasing from 1993-2005. Fires are often caused by urban activity, so this trend could lead to more wildfires in the future. In our meteorological exploration, we found that the cause-effect relation of meteorological variables on wildfires may be highly multifactorial. The correlations between the number of fires and meterological varibles were small marginally in the population. To disentangle the effect of meterological variables from the effect of land cover on fires, we examined the joint distribution of these variables. We found that the correlation between meteorological variables and land cover types were weaker than expected, which makes it more feasible to interpret these variables as independent causes of wildfires.

Most of our analysis has targeted marginal associations between risk factors and the number of wildfires or aggregate burnt area (i.e. unconditional on subsets of geographical gridpoints). These correlations were often weak, but grew stronger in subsets of the population with a greater number of wildfires. This motivated us to characterize the heterogeneity in risk of wildfires by conducting subgroup analyses. To do so, we performed quantile regression, which revealed that temperature substantially exacerbated the risk of wildfires in areas which are prone to wildfires. This is in spite of the fact that temperature was only weakly associated with wildfires marginally.

Furthermore, we have seen a weakly rising trend in the temperature over the past years. Coupled with our observation fact that the proportion of urban land cover has been rising steadily with time, and that human activity is an important cause of wildfires, we note that we may see an increased occurrence of wildfires in the years to come.

Future improvements

An extension to our analysis would be to use our data to do some predictions. So it would also be advantageous to find a more recent dataset, since ours is limited to 2015. As a consequence, we could perceive a more important effect of global warming, and previous correlations might appear more distinctly.

Another improvement would be to look at other regions that are subject to wildfires, such as Australia or African countries. This would increase our dataset and help us capture the main factors and causes by comparing the areas (more heterogeneity in our data).

Finally, our data may contain fires that are of criminal origin. In this analysis, we have assumed that the causes are of natural origin. An improvement would be to take this into account and possibly identify those cases which do not interest us in this context.

Alejandra Borunda. 2021. The Science Connecting Wildfires to Climate Change. National Geographic. https://www.nationalgeographic.com/science/article/climate-change-increases-risk-fires-western-us.

André Gabrielli. 2019. Wildfires. National Geographic. https://www.nationalgeographic.org/encyclopedia/wildfires/.

Aria Bendix. 2018. Before-and-After Photos Show the Devastating Destruction in Malibu as the California Wildfires Rage on. Business Insider. httphttps://www.businessinsider.com/california-wildfires-photos-malibu-woolsey-fire-2018-11?r=US&IR=Ts://www.R-project.org.

Barros, Ana M. G. and Pereira, José M. C. 2014. Wildfire Selectivity for Land Cover Type: Does Size Matter? PloS One. Vol. 9. Public Library of Science.

Holly Yan, Cheri Mossburg, Artemis Moshtaghian and Paul Vercammen. 2020. California Sets New Record for Land Torched by Wildfires as 224 People Escape by Air from a ’Hellish’ Inferno. CNN. https://edition.cnn.com/2020/09/05/us/california-mammoth-pool-reservoir-camp-fire/index.html.

Mert Ozkan and Ezgi Erkoyun. 2021. Turkish Wildfires Are Worst Ever, Erdogan Says, as Power Plant Breached. Reuters. https://www.reuters.com/world/middle-east/fire-near-turkish-power-plant-under-control-local-mayor-2021-08-04/.

Michael Slezak. 2020. 3 Billion Animals Killed or Displaced in Black Summer Bushfires, Study Estimates. ABC News. https://www.abc.net.au/news/2020-07-28/3-billion-animals-killed-displaced-in-fires-wwf-study/12497976.

ref. 1AD. The Given Dataset Was Too Huge to Be Imported in Python, Since It Is a Compressed Rdata File, Which Exceeds 300 Mb When Converted to a Csv File Meaning Similar or Bigger in Size When Imported to Python. To Deploy Python Applications, One Can Use Heroku, but the Maximum Given Memory for Computation Is 512 Mb, Which Is Too Small with Respect to This Dataset. One Can Try to Minimize the Use of Memory, but It Would Be Almost Impossible Since at Least Two Times the Size of the Dataset Would Be Needed as Free Memory in Order to Make Different Plots (as Importing the File Is Needed First, and Breaking It down to Several Dataframe Objects Would Take at Least as Many Memory as the Size of the Whole Dataset). Another Approach Would Be to Use R Instead and Deploy the Application on Shinyapps.io, Because Rdata Is Believed to Be Computation-Wise and Memory-Wise More Efficient to Be Handled with R.

Sarah Gibbens. 2021. Wildfire Smoke Blowing Across the U.s. Is More Toxic Than We Thought. National Geographic. https://www.nationalgeographic.com/environment/article/wildfire-smoke-blowing-across-country-more-toxic-than-we-thought.

Schaefer, Alexander J. and Magi, Brian I. 2019. Land-Cover Dependent Relationships Between Fire and Soil Moisture. Fire. Vol. 2. Multidisciplinary Digital Publishing Institute.

Steven Reinberg. 2021. Wildfires Cause More Than 33,000 Deaths Globally Each Year. U.S. News. https://www.usnews.com/news/health-news/articles/2021-09-09/wildfires-cause-more-than-33-000-deaths-globally-each-year.

Topher Gauk-Roger, Stella Chan, Jason Hanna and Steve Almasy. 2020. California Wildfires: Fire Chief Says Dozens of Major Blazes Have State in ’Dire Situation’. CNN. https://www.cnn.com/2020/09/08/us/california-fires-tuesday/index.html.

References